A MODEL BASED APPROACH TO 
NON-UNIFORM VOWEL NORMALIZATION 


By 

S. V. Bharath Kumar 



DEPARTMENT OF ELECTRICAL ENGINEERING 

Indian Institute of Technology Kanpur 

MARCH, 2002 



A Model Based Approach To 
Non-Uniform Vowel Normalization 


A Thesis Submitted 

in Partin! Fulfilniont of the Rccpiircmcuts 
for the Degree of 

Master of Technology 


by 

S. V. Bharath Kumar 



to the 

DEPARTMENT OF ELECTRICAL ENGINEERING 

INDIAN INSTITUTE OF TECHNOLOGY, KANPUR 


March, 2002 



2 6 APR 2002 




CERTIFICATE 


This is to certify that the work contained in the thesis entitled “A Model Based 
Approach To Non-Uniform Vowel Normalization” , by S. V'. Bharath Kumar, has been 
carried out under my supervision and that this work has not been submitted elsewhere 
for a degree. 


March, 2002 





« 

(Dr. S. Umesh) 

Associate Professor, 

Department of Electrical Engineering, 
Indian Institute of Technology, 
Kanpur. 



Abstract 


A model based vowel normalization procedure is proposed based on our 
study of the nature of relationships between formant frequencies of speakers. Con- 
ventionally, uniform scaling relationship between formant frequencies of speakers 
is assumed. In this thesis, we explore non-uniform scaling relationship between 
formant frequencies and then do appropriate speaker normalization for application 
in automatic speech recognition. The proposed model based vowel normalization 
procedure is independent of vowel class and is completely derived from Peterson 
& Barney and Hillenbrand et al. vowel formant databases. The frequency-warping 
necessary to do non-uniform vowel normalization using the model based procedure is 
similar to log-warp function. This method has been analysed using various cluster- 
discriminability measures, scatter plots and HMM-based vowel recognizers. 

In this thesis, we also made a comprehensive study on the vowel normaliza- 
tion methods based on frequency dependent scaling of formant frequencies and scale- 
invariant transformation, each of which shows that the frequency-warping function 
required for normalization is a compromise between log-warp and mel-warp func- 
tions. Using separability measures like F-ratio and residual variance, the proposed 
method is found to be superior to Nordstrom & Lindblom’s uniform scaling method 
and Fant’s non-uniform normalization method. In addition, we have also compared 
the vowel-recognition performance of the proposed method with the other methods 
in a HMM-based recognizer. Using recognition accuracy as the performance mea- 
sure, the proposed model based method is found to provide the best normalization 
for cross-gender cases. 
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Chapter 1 
Introduction 


Automatic speech recognition (ASR) system enables a computer (or a machine) to 
recognize words spoken by a person. Automatic recognition of speech by machine 
has been a goal of research for more than four decades. However, inspite of the 
glamour of designing an intelligent machine that can recognize the spoken word 
and comprehend its meaning, and inspite of the enormous research efforts spent in 
trying to create such a machine, we are far from achieving the desired goal of a 
machine that can understand spoken discourse on any subject by all speakers in all 
environments. The problem of ASR is dependent on many factors such as vocabulary 
size, speaker characteristics, accent, noise and channel characteristics. Hence, the 
whole problem can be tackled under two broad categories (1) Robustness to speaker 
variations and (2) Robustness to noise and channel effects. Assuming that the ASR 
system is robust towards noise and channel effects, the only major factor that affects 
its performance is the variability among speakers. 

Depending on the speaker characteristics of the dataset used to train an ASR 
system, there are broadly two classes (1) Speaker Dependent (SD) and (2) Speaker 
Independent (SI) systems. Speaker dependent systems are trained from speech data 
collected from a single user, who is the sole user of the system. On the other hand, 
speaker independent systems are trained from speech collected from many differ- 
ent users. Typical applications of SD systems include desk-top applications, word 
processing, etc., while SI systems are typically used at public interfaces like airline 
interface system, telephone directory service, etc. where there are varied type of 
speakers. While the SI systems yield better recognition rates for speakers who are 
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not in the training dataset than speaker dependent systems, they are less accurate 
than adequately trained SD systems for a given speaker who has contributed to the 
training dataset. This degradation in the performance of SI systems over SD systems 
for a given speaker is mainly due to presence of large speaker variability in the train- 
ing set of SI systems. Hence, it may be possible to achieve performances close to SD 
systems for a given speaker, if the variability in the training set is removed/reduced. 
The aim of speaker normalization techniques is to remove these speaker specific 
variabilities from the SI systems. These speaker variabilities can be due to physi- 
ological differences in the speech production apparatus or non-physiological factors 
like dialect, emotions, speaking idiosyncrasies etc. 

A major source of the variability among speakers is attributed to the phys- 
iological diflferences in the vocal tract of the speakers. As an approximation, the 
vocal tract is assumed to be of uniform cross-section, in which case the speaker 
variability is directly related to the vocal tract length (VTL). It has been found 
that VTL variation causes scaling in the spectral domain [1] since the formant fre- 
quencies are inversely proportional to length of the tube [2]. Many normalization 
schemes, both linear scaling [1, 2, 3, 4] and non-linear scaling [5, 6] (of formant 
frequencies) have been proposed which compensate for this variability by re-scaling 
the frequency axis, resulting in substantial improvements in speech recognition per- 
formance [1, 2, 3, 4, 5, 6]. However, Fant [5] and others [6, 7] have shown that 
uniform/linear scaling of formant frequencies is a very crude approximation and 
that the formant scaling is non-linear and is phoneme dependent. 

In this thesis, we have attempted to model these non-linearities in scaling as 
a function of frequency alone and have decoupled it from phoneme dependence unlike 
other methods. We have made a study of relationships between formant frequencies 
of speakers to understand the nature of non-linearity present between them. Based 
on this study, we have developed a model for the non-linear relation between the 
formant frequencies and applied it for vowel normalization. The frequency- warping 
function calculated based on the model we developed is found to be close to log- warp 
function. 

In addition, we have also made a comprehensive study related to non-linear 
scaling of formants in which we estimate an improved frequency-warping function 
as compared to [8] which is a compromise between log-warp and mel-warp func- 
tions. We have also obtained a similar warping function based on modification of 
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scale-invariant transformation [7]. We have used different analysis methods such as 
formant data analysis, scatter plots and HMM-based vowel recognizers to compare 
the performance of the proposed model based vowel normalization procedure to that 
of other similar techniques. 

The thesis is organized as follows. In Chapter 2, the motivation for non- 
linear scaling of formants is provided while discussing Nordstrom & Lindblom’s [1] 
linear scaling method, Pant’s [5] non-uniform normalization and frequency depen- 
dent scaling [6, 8] method. In Chapter 3, the proposed model based vowel normal- 
ization is presented. In Chapter 4, we compare the performance of our proposed 
method with the methods discussed in Chapter 2 in terms of residual variance, 
F-ratio and scatter plots. In Chapter 5, we present our comprehensive study in 
modelling the non-linear scale factor to obtain a frequency-warping function. In 
Chapter 6, the performance of these normalization schemes are evaluated, analysed 
and compared with respect to that of other similar techniques, using percentage 
accuracy as performance measure, by incorporating them into a HMM-based vowel 
recognizer. Finally, in Chapter 7, using all the experimental results, conclusions were 
drawn about the effectiveness of the proposed model based normalization approach 
in a recognizer/classifier framework. 



Chapter 2 

Vowel Normalization by 
Frequency Dependent Scaling 


One of the major factors affecting speech recognition is the speaker dependence of 
the speech signal. It is a well known fact that because of the differences in vocal 
tract dimensions, two speakers may produce vowels that sound similar although 
they have very different formant values, and they may also produce vowels that 
sound different but which have remarkably similar formant values [9]. Generally, 
as a first-order approximation, the vocal tract shape is assumed to be a tube of 
uniform cross-section. Hence, difference in lengths lead to difference in formant 
values. An average adult male has a vocal tract length (VTL) of around 17cm, 
while the average female VTL measures around 14.5cm. The first-order effect of 
the difference in VTL is the scaling of the frequency axis, i.e., on an average the 
formants of an average female speaker are scaled up by 20% with respect to that of an 
average male speaker, with the differences most severe in vocal tract configurations 
in open vowels. Hence, it is commonly assumed that differences in formant patterns 
between male and female speakers are related by a pure scale factor which is inversely 
proportional to VTL [2, 10]. Different normalization procedures have been proposed 
in literature [1, 2, 5, 11] which counteracts the effect of varied vocal tract lengths. 


4 
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2.1 Nordstrom & Lindblom Method : Simple Lin- 
ear Scaling 


Nordstrom & Lindblom [1] have proposed a simple normalization procedure based 
on an estimate of the speakers’ average VTL in open vowels as determined from 
the measurement of the third formant Fz- As a support of their procedure, they 
demonstrated a substantial reduction of the male-female-child differences in the 
Peterson & Barney [9] database on American English vowels. In their procedure of 
uniform/linear scaling, the formant frequencies of the subject to be normalized are 
simply to be divided by the factor 


a — 



Fzsub _ + 1 

Firef ^sub + 1 


( 2 . 1 ) 


where k is the scale factor in percentage, Fz^^^ and Fz^^^ are the average Fz oi open 
vowels (vowels with Fi greater than 600Hz) of the subject and the reference “male” 
speaker, I sub and Iref are the VTL’s associated with the subject and the reference 
speakers respectively. 

As mentioned before, the uniform tube is only a first-order approximation to 
the vocal tract shape, resulting in a uniform scaling to do the normalization. But in 
general, the formant frequency locations (in Fi —F^—Fz plane) for vowels are affected 
by three factors: the effective length of the pharyngeal-oral-tract, the location of 
constrictions along the tract, and the narrowness of the constrictions [12, 13]. Simple 
linear scaling neglects both the location of the constrictions and the vocal tract 
shape. Figure 2.1 shows that the simple linear scaling is a function of both formant 
number and vowel category. 


2.2 Non-Linear Scaling 

Fant [5] has suggested a non-uniform method of simple scaling procedure by mod- 
ifying the correction factor, k as a function of both formant number and vowel 
category. With this non-uniform normalization, Fant showed a substantial reduc- 
tion in speaker differences between male and female than the simple linear scaling 
as proposed by Nordstrom k Lindblom. 
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Figure 2.1: Deviations from linear scaling. 

Figure shows the actual values of {Fi,F 2 ) points for average male and average fe- 
male speakers for various vowel categories in the Hillenbrand et al. [If] database 
on American English vowels. Dashed line indicates predicted (^ 1 ,^ 2 ) point, with 
a linear scaling a = l.lf. Variation in the distances between predicted and actual 
points, over vowel categories is evident. Fi & F 2 are in Hz. 


Fant calculated the reference scale factor, between the average female 
and the average male (i.e. the reference speaker) for the formant of the 
vowel class. Interested reader is referred to [5, 8] for the study of the formant 
specific weighting factors for different genders as defined by Fant. In non-uniform 
normalization procedure, Fant calculated the factor k for the average female to be 
17, with average male being the reference speaker using a method slightly different 
from Eq. (2.1) and by using 6 to 8 different vowel databases of different languages. 
Apart from using the k (let us redefine it as iiopen) as defined by Nordstrom k 
Lindblom, Fant also used the k value determined from {F 2 Fz )2 of the front vowel 
/lY/, with 0.5 weighting, using the formula defined in Eq. (2.1). Thus the scale 
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factor that Fant used for non-uniform normalization procedure is given by, 

^ ^ /lY/ + k3 /lY/) 2) 

3 

Fant’s non-uniform normalization for any particular adult subject speaker is given 
by the weighting of with the ratio of subject’s particular k to the k = 17% of 
the average female speaker with respect to average male reference speaker i.e. 

For the child speaker, Fant proposed the following non-uniform normalization scheme. 

K = (^) + (k - 24) for k > 24 (2.4) 

Eq. (2.3) and Eq. (2.4) represent the best prediction of the subject’s scale factor for 
a particular formant of a particular vowel. 

2.2.1 Experiments and Results 

Earlier experiments [8] on Fant’s non-uniform normalization method have showed 
a better vowel normalization for the average female as reference speaker than the 
average male as reference. So, in our experiments related to Fant’s approach, the av- 
erage female was chosen as the reference speaker. This selection provides a common 
normalization formula for both adult and child speakers given by 

= KiM . ( 2 - 5 ) 

where is the scale factor of an average male with respect to average female reference 
speaker calculated using Eq. (2.2). Based on Fant’s approach, (p was calculated 
to be —14.65 for Peterson & Barney (PnB) database and —12.18 for Hillenbrand 
(HiL) database respectively. In the calculation of <p, (/AA/, /AE/, /EH/) and 
(/AE/, /AW/, /EH/) were considered as open vowels for PnB and HiL databases 
respectively. The subscript M in the notation k^^ is used to emphasize that the 
scaling is for the average male subject with respect to average female speaker as 
reference. Table 2.1 shows the knM values calculated using Fant’s method for PnB 
and HiL databases. 
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2.3 Frequency Dependent Scaling Method 

The main motivation behind this work was Fant’s non-uniform normalization pro- 
cedure. Basically, this work was motivated by the idea to apply Fant’s non-uniform 
normalization method to reduce the inter-speaker difference thus aiding in speaker- 
independent speech recognition. Fant’s non-uniform normalization scheme, though 
is definitely better than the simple uniform scaling, it cannot be directly applied 
for speaker-normalization since it requires knowledge of the vowel category and 
the formant number. The basic idea behind Frequency Dependent Scaling (FDS) 
method [ 5 , 6] is to model the weighting factor knMi a function of frequency alone, 
thus making it context independent (i.e. independent of vowel category) and formant 
independent. This algorithm should do away with the need for apriori knowledge 
about the vowel category, while at the same time should do better than simple linear 
scaling. 


2.3.1 Frequency Dependent Scaling Factor, Tf 

The weighting factor, is a function of both formant number and vowel category. 
The knM value shown in Table 2.1, calculated using Fant’s method was averaged 
over vowel category and formant number, for the respective databases to obtain a 
frequency dependent scaling factor, 7/, which is purely a function of frequency [6]. 
The modelling of the weighting factor knM as a function of frequency alone was 
essentially done by plotting knM for each formant number and vowel as a function 
of subject’s formant frequency. This was done for all speakers in the database and 
the averaging was done along the frequency axis over small bands of lOOHz width. 
7/ represents the frequency dependent scale factor in a given lOOHz band. A vector 
of frequency specific scale factors, 7/ which are independent of formant number and 
vowel category was obtained. We denote this frequency dependent scale factor array 
as Tf. The subscript / shows that the parameter is frequency dependent. A plot of 
Tf is shown in Figure 2.2 for PnB and HiL databases, where each stem corresponds 
to the value of 7/ over a lOOHz band. The normalization scheme is given by 


k/ = 7/ 


k 


( 2 . 6 ) 
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Formant scale factors 
knA^(%) 

PnB 

HiL 


~^2M 


-ki/w 



/AA/ 

n 

11 

12 

- 

- 

- 

/AE/ 

23 

16 

15 

14 

17 

13 

/AH/ 

17 

15 

14 

18 

14 

11 

/AO/ 

03 

09 

12 

- 

- 

- 

/AW/ 

- 

- 

- 

20 

14 

10 

/EH/ 

13 

21 

17 

21 

13 

12 

/El/ 


- 

- 

11 

17 

12 

/ER/ 

03 

17 

14 

10 

14 

11 

/IH/ 

11 

19 

16 

12 

14 

12 

/lY/ 

14 

18 

11 

22 

15 

11 

/OA/ 

- 

- 

- 

12 

12 

13 

/oo/ 

- 

- 

- 

11 

09 

13 

/UH/ 

07 

12 

16 

19 

17 

12 

/UW/ 

19 

. 

09 

16 

18 

10 

. 

13 


Table 2.1: Formant and vowel specific scale factors, knM 

Here denotes that the corresponding vowels are not present in the respective 
databases. PnB refers to Peterson & Barney database and HiL refers to Hillen- 
brand database. 


The above normalization procedure, in its present form, is applicable only to discrete 
formant patterns, as Fy is actually an array of weighting factors. But the state of the 
art modern day speech recognizers make use of continuous spectral patterns. Hence 
for FDS method to be implemented on a recognizer/classifier, the normalization 
procedure defined in Eq. (2.6) need to be extended for continuous spectral patterns. 
To modify F/ to be a continuous function of frequency, a simple curve was fitted 
using TableCurve2D to the array of weighting factors against frequency shown as 
a stem plot in Figure 2.2. Since the continuous function which was obtained is 
not exact, we term it as an approximate scaling function, 7(/)- The approximate 
scaling function along with the stem plot is shown in Figure 2.3 for both PnB 
and HiL databases. Table 2.2 shows the equations of the curvefits for frequency 
dependent scale factors for PnB and HiL databases. Since the scale factor defined 
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(a) 



(b) 
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Figure 2.2: Frequency dependent scale factors, F/ 

Figure shows the scaling factors in percentage for male group with female as the 
reference speaker for (a) PnB database and (b) HiL database 


in Eq. (2.6) is frequency dependent, replacing k defined in Eq. (2.1) by k/, the 
frequency dependent scale factor is obtained as 


^(/) 




1 + 
1 + 


100 100 
7(/)(q!- 1) 


(2.7) 


With this approach, though there is no context dependence, yet we need to explicitly 
calculate the scale factor for each speaker to do normalization. Eq. (2.7) shows that 
a{f) is not only frequency dependent but also speaker dependent because of the 
presence of factor a or k. To achieve speaker normalization, one way is to explicitly 
estimate the scale factor, a [3, 4, 5] for each speaker and use it to normalize him 
with respect to some reference speaker. The other way is to use the knowledge 
of scaling function, 7 (/) so that a universal frequency-warping function leading 
to scale invariance [7, 10] can be developed. This will be of immense importance 
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Database 

7(/) 

PnB 

HiL 

-15 -b 4.63e--^/^23.2i 

-16.53 + 0.04/°'59 


Table 2.2: Equations of curvefits for Tf. 

j{f) is the closed form equation for Tj. The curvefits were obtained by using Table- 
Curve2D package. 


in deriving speaker independent robust features. This motivated us to derive a 
frequency warping function for frequency dependent scaling method. 


2.4 Frequency- Warping Function Based on Fre- 
quency Dependent Scaling Method 

Consider two speakers A and B, whose spectras are related by 

SaU) = SeigicxAB, /)) ( 2 . 8 ) 


where g{aAB, f) is some function that involves speaker dependencies through the first 
argument. Let f = giaAB,f)- Our aim was to determine g(.) so that H(S^(/)) = 
H(Sn(/)), where H(.) denotes some mapping from /-domain to some domain, say r/. 
The non-linearity g{cxAB, f) has been modelled in many parametric forms [7, 15, 16]. 
We modelled it as 

/' = g(a./) = a(/)/ = a«(«/ (2,9) 

where a is the subject’s scale factor with respect to a reference speaker, which is 
frequency independent and /?(/) is only frequency dependent and is independent of 
speaker. 0(f) captures the non-linearity in scale factor. Eq. (2.9) can be modified 
as 


log(/') 

m 


= /3(/) log(o) + log(/) 

= log(a) + J|)^ 


" = “(/) = 


l°g(/) 


( 2 . 10 ) 

( 2 . 11 ) 


Define 


( 2 . 12 ) 
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(a) 



(b) 



Figure 2.3; Frequency dependent scaling factor, F/ and frequency dependent scaling 
function, 7 (/) 

Figure shows the frequency dependent scaling function, 7 (/) obtained by curvefitting 
Tf for (a) PnB database and (b) HiL database. 


Assuming P(f') ~ /?(/), we get 

u' = u + Iog(a;) = u + constant shift (2.13) 

where u is the warped domain and W(/) is the frequency warping function. Eq. (2.13) 
shows that the spectras in the warped domain are translated versions of one another. 
The magnitude of the Fourier transform of these warped spectral patterns are in- 
variant to translations, leading to scale invariant features, of real speech signals. For 
the given model, frequency-warping function can be derived as. 



V 

(2.14) 

hence, 



10g(l + 

(2.15) 


= log(a; 


Eq. (2.15) is valid for all values of a, as /?(/) is assumed to be speaker independent. 
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(a) 



(b) 



Figure 2.4: /?(/) derived from frequency dependent scaling factors, Tj 

(3(f) was calculated usingTf, instead of'y(f) to avoid the curve-fitting errors. Figure 

shows the plot of average P(f) for (a) PnB database and (b) HiL database. 

2.4.1 Experiments and Results 

In our experiments with FDS method, we obtained 'y(f) by fitting a simple curve to 
Tf values. Figure 2.3 shows that the curvefit, 7 (/) is not accurate and infact, it is 
only 60% accurate. Hence in order not to introduce curve-fitting errors, we carried 
out our experiments using Tf itself instead of j(f). Though our assumption is that 
P{f) in only frequency dependent, actually, /?(/) will be different for different values 
of a which is obvious from Eq. (2.14). Values of P(f) obtained for all speakers in 
the database were averaged over and an average /?(/) was calculated for both PnB 
and HiL databases. Since p(f) was calculated from Tf, it is actually not a function 
of frequency but an array of factors dependent on frequency. Figure 2.4 shows the 
average /?(/) obtained for PnB and HiL databases. 

The frequency-warping function, W(/) defined in Eq. (2.12) is not correct 
as the assumption P(f') at p(f) is not valid. Since finding the closed form solution 
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Database 

«if) 

PnB 

HiL 

-3400.68 + 649.40 log(/) 
-1350.19 + 50.03 (log(/)f 


Table 2.3: Equations of curvefits for discrete warping function, Wj(/). 

W(/) is the closed form equation for the discrete warping function, ^i{f)- The curve- 
fits were obtained by using TableCurveBD package 

for frequency- warping function is difficult, we discretized the warping function over 
a set of frequency bands and calculated the warping function for each band. A 
similar kind of situation arises in Section 5.3.1, where a discrete warping function 
was calculated, the warping parameters being calculated in a different way. Given 
the warping parameters, the methods used to compute discrete warping function 
were exactly the same. We present the method of finding the discrete warping 
function with full details in Section 5.3.1. Here, we present the details about the 
warping function that was calculated using the method described in Section 5.3.1. 

In short, the method is as follows. Let us divide the frequency axis into N 
logarithmically equi-spaced regions. /?(/) was discretized into Pfs, where /?i is the 
value of /3 in frequency region. Pfs were calculated by averaging the values of 
/?(/) that lie in z*'* frequency region. The value of N has to be so chosen that /?(/) 
and piS should not be very different. We chose N = 10. Figure 2.5 shows the plot 
of p{f) and Pi's. The warping function for each band was calculated as 

«i(/) = ,!<•<» (2-16) 

The closed form solution to the frequency-warping function, W(/) was obtained by 
fitting a curve to the discrete warping functions, Wi(/), z = 1,2, ••• ,N. Table 2.3 
shows the equations of curvefits for the discrete warping function, Wi(/) for PnB 
and HiL databases. Figure 2.6 shows the actual warping function, Wi(/) and its 
curvefit, W(/) for PnB and HiL databases. This is plotted to show how close the 
curvefit is to Wi(/) as W(/) is the one that is actually required while implementing 
the normalization method on continuous spectral patterns. Figure 2.7 shows the 
warping functions derived using FDS method along with log-warp and mel-warp 


I 
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(a) 



(b) 



Figure 2.5: p(f) and its discrete version, 

Figure shows average /?(/) along with its discrete version, Pi for (a) PnB database 
and (b ) HiL database. It is obvious to see that pi provides a good representation of 
P{f) which helps in deriving the discrete warping function, Wi(/), Eq. (2.16). 


functions, the latter [17] being defined by 

= 25951ogio (l + (2.17) 

It is very interesting to note that the warping function derived using FDS method 
lies between log- warp and mel-warp curves. W(/) follows log-warp curve at low 
frequencies (< 500Hz) and mel-warp curve at high frequencies (> 3500Hz). The 
warping function derived out of HiL database is more closer to mel-curve at high 
frequencies than the one derived from PnB database. Log-warping refers to simple 
linear or uniform scaling of formant frequencies of the speakers. Mel-warp function 
was applied in speech recognition not from the speaker normalization point of view 
but from psychoacoustic view point. Since human ear behaves on mel-scale [18], mel- 
warping function is used in speech recognition to emulate the human ear. Now, our 
experiments with FDS method has revealed a frequency-warping function derived 
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(b) 



Figure 2.6: Warping function, Wj(/) and its closed form approximate, W(/) 

Figure shows the warping function, Wi(/) and its closed form approximate, W(/) given 
in Table 2.3 for (a) PnB database and (h) HiL database. The curvefits to Vi{f) were 
the best fits obtained from TableCurveBD. 


using speech data alone, which behaves mel-like at higher frequencies and log-like at 
lower frequencies and acts as a compromise between the two in the middle region, 
which is the region of interest in speech recognition. 

2.5 Summary 

The differences in vocal-tract dimensions among the speakers is one of the ma- 
jor factors affecting speech recognition. The first-order approximation of the vocal 
tract dimension to a uniform tube results in uniform scaling of formant frequencies. 
Nordstrom & Lindblom’s and Fant’s methods of normalization were presented. A 
frequency dependent scaling method was proposed, which is formant and context 
independent unlike Fant’s non-uniform normalization method. The non-linearity of 
the scaling function obtained from frequency dependent scaling method was mod- 
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Figure 2.7: Comparison of warping function derived from frequency dependent scal- 
ing method with log-warp and mel-warp functions. 

Figure shows the warping functions derived using FDS method along with log-warp 
and mel-warp functions. Log-warp function refers to uniform scaling of the formant 
frequencies of speakers. Mel-warp function is derived from psychoacoustic studies. 
It is interesting to note the similarity of these warping functions, though they are 
derived from entirely different studies. 


elled and a warping function was derived out of the speech data which resembles 
log- warp function at low frequencies (< 500Hz) and mel-warp function at high fre- 
quencies (> 3500Hz). An interesting point was made in deriving a frequency- warping 
function that stands as a compromise between mel-warp and log-warp functions. 




Chapter 3 


Study of Relationships Between 
Formant Frequencies of Speakers 

3.1 Motivation 

For last five decades, a lot of research has been carried out to solve the problem 
of speech recognition and a large amount of understanding has been developed in 
this process, but the complete solution of the problem still remains elusive. The 
problem of speech recognition is very broad and we restricted ourselves to speaker 
independent speech recognition. Inter-speaker variation is a major factor that af- 
fects the performance of speaker independent speech recognition. Human ear is 
the system that has very high speech recognition accuracy than any modern day 
recognizer. A lot of research has undergone in the field of psychoacoustics leading 
to the understanding of human auditory mechanism [18, 19]. One can ask a very 
basic question. “Why don’t we embed the psychoacoustic properties of ear into the 
machine” . Thus, the knowledge gained in the field of psychoacoustics was employed 
into the recognizers. This lead to the cropping of terms “Mel Scale” [18] and “Bark 
Scale” [20] into the parlance of speech recognition. Though the recognition rates 
were improved, still there remains vacuum to be filled up. 

Human beings are able to handle the variability of speech very well, which 
is not the case with the machines yet. Here one interesting point to note is that the 
variability in the speech data (due to the physiological differences in the speech pro- 
duction system) is sort of nullified by the human auditory mechanism. The problem 
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of making the machines recognize better was attacked initially from psychoacous- 
tic view point and not from the point of view speaker normalization. In order to 
understand and parameterize the variability in speech, one might be motivated to 
ask, “Given any two speakers, what is the relationship between them in terms of 
speech related parameters”. The solution to this question will provide an insight 
into the speech production mechanism itself. Given a set of speakers, if we can find 
a universal relation relating all of them to a reference speaker, it will be very helpful 
in normalizing all the speakers to a single reference speaker thus aiding in speaker 
independent speech recognition. 


3.2 Model Based Normalization 


We propose the following model relating the formant frequencies of the subject 
speaker and the reference speaker as 


F-ji = a.Tzs 



(3.1) 


where F-n, Fs are the formant frequencies of the reference speaker, TZ and the subject 
speaker, S respectively. b and c are the parameters of the model defined in 
Eq. (3.1) which are to be estimated from the speech data. Eq. (3.1) shows that a-iis 
is a speaker-dependent parameter. We assume b and c to be independent of speaker 
variability. 

In our experiments with the model in Eq. (3.1), the reference speaker was 
chosen to be the average female of the database. For a given subject speaker, 
Eq. (3.1) was fitted between the arrays of formant frequencies of the subject and the 
reference speaker. PnB database consists of 76 speakers (33 males, 28 females and 
15 children), each of them contributing two utterances for each of 10 vowels (/AA/, 
/AE/, /AH/, /AO/, /EH/, /ER/, /IH/, /lY/, /UH/, /UW/). In our analysis, 
each utterance was considered to be uttered by a different speaker, thus having 152 
speakers (66 males, 56 females and 30 children) each uttering 10 vowels. Each of 
these 10 vowels are characterized by Fi , F-z and Fz formant frequencies. An array of 
frequencies of a given speaker, thus will be of size 30. Eq. (3.1) was fitted between 
two 30 X 1 frequency vectors. 

HiL database effectively consists of 98 speakers (37 males, 33 females, 13 
boys and 15 girls), each of them uttering only once for each of 12 vowels (/AE/, 
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/AH/, /AW/, /EH/, /El/, /ER/, /IH/, /lY/, /OA/, /OO/, /UH/, /UW/). These 
vowels are also characterized by Fi, F 2 and F3 formant frequencies. Eq. (3.1) was 
fitted between two 36 x 1 frequency vectors for HiL database. The validity of the 
model was tested for all speakers and the average estimation error energy in fitting 
the data was calculated to be less than 1.5% of the energy of the corresponding 
data. Since, Eq. (3.1) has 3 degrees of freedom, each speaker is characterized by 3 
parameters. Thus the size of the parameter matrix of the database will be M x 3, 
where M is the size of the database. Before getting along further, let us ask ourselves 
few questions. What is the motivation behind choosing the model in Eq. (3.1)? Is 
this model valid (in the sense, there may be many models that fit the data better 
than Eq. (3.1))? Let us answer the first question. The chief motivating factor 
in choosing the model in Eq. (3.1) was to study whether a “mel-like” frequency- 
warping function can be obtained from speech data alone. If this is the case, this 
shows certain connection between the speech production process and the hearing 
mechanism. This also justifies the use of mel-warp function in speech recognition, 
not only from point of view but also from the point of view of speaker 

normalization. 

Taking logarithm on both sides of Eq. (3.1), we have 

1 + ^) (3.2) 

Define 

r)s = log(T7e) - log(a7^5) (3.3) 

Thus from Eq. (3.3) we have 

where rjs represents the formant frequencies of speaker, S in the warped domain. 
Similarly, for speaker Q, the warped formant frequencies are given as 

tjq = log(F7j) - log(a7i2) = c log ^1 + (3.5) 

Hence, based on Eq. (3.4) and Eq. (3.5), the frequency-warping function to do 
speaker normalization is given by 

77 = clog (1 + 0 


T]S =,cl0g 


log(F7t) = log(a-/j5) + c log 


(3.6) 



21 


Database 

b (7b 

c (Jc 

PnB 

HiL 

0.7710 0.3362 

0.7369 0.2700 

0.9756 0.0575 

0.9761 0.0448 


Table 3.1: Estimates of parameters b and c for model based normalization. 

The estimates of h and c were calculated by fitting Eq. (3.1) for all the speakers 
of the database with average female as the reference, and are the standard 
deviations ofh and c respectively. 


Eq. (3.3) and Eq. (3.5) show that in the warped domain p, the speakers are shifted 
versions of each other, the shift factor being speaker specific. Since the magnitude 
of Fourier transform is shift-invariant, the features derived are thus shift-invariant in 
the warped domain and thus resulting in speaker, normalization. It is interesting to 
note that Eq. (3.6) is of the form functionally similar to mel-warp function, suggested 
by Shaughnossy [17] formula as 

W = 2595 log, „(l + ^) (3.7) 

The closeness of Eq. (3.6) to Eq. (3.7) is to be verified by fixing b and c, thus 
making them speaker-independent. Also, the model with 3 degrees of freedom is 
very difficult to implement on the recognizer as the parameters are to be estimated 
on a 3-D mesh, which is terribly cumbersome. These were the two problems that 
motivated or rather restricted us to reduce the dimensionality of Eq. (3.1) to one, 
by fixing b and c. The histogram was used in studying the distributions of b and c. 
Finally, b and c values were fixed to the mean values of their respective distributions. 

Table 3.1 shows the estimates of b and c for PnB and HiL databases. The 
standard deviations, and ctc of the parameters b and c respectively, shown in 
Table 3.1 confirm our assumption that b and c are speaker-independent. Hence 
the normalization scheme involves using Eq. (3.1) where b and c are chosen from 
Table 3.1 for appropriate databases. This can be implemented easily on a recognizer, 
as the parameter search will be over 1-D mesh. The speaker-dependent ^ns can be 
computed from least squares fit between the formant frequencies of the reference and 
the subject speaker. This step can be considered to be equivalent to the estimation 
of k, as discussed in Chapter 2. Figure 3.1 shows the histograms of a-ns for PnB and 
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HiL databases for male, female and child speakers. It is interesting to note that bl-jis 
and a (in Eq. (2.1)) are inversely related to each other. An average female being 
the reference, a male subject will have ck < 1 and a child subject will have o; > 1. 
Figure 3.1 shows that male subjects have a.Tis > p and a child speakers have a.Tzs < p 
with respect to an average female speaker as reference where p is some threshold 
dependent on the database. The trend in the estimates of across the genders 
shows the existence of gender separability. Though, there exist few outliers resulting 
in crossover of the clusters, the number of such speakers are very less compared to 
the size of the database. Thus, the model in Eq. (3.1) is valid in the sense of gender 
separation. The warping function is given as 



0.9756 log (1 + 5 ^) 
0.9761 log (1 + 5 ^) 


for PnB database, 
for HiL database. 


(3.8) 


Having answered our first question, let us try to answer the second one. Is the model 
in Eq. (3.1) valid? If so, how good it is? 


3.3 Model Validity 

To answer the second question, we made a comprehensive study of the relation 
between speakers, by finding out different models apart from Eq. (3.1) that normalize 
the speakers. The analysis was carried out on the formant data of speech collected 
from 14 average speakers for both PnB and HiL databases. 5 average male, 5 average 
female and 4 average child speakers were obtained for both the databases. Hence the 
size of the dataset that was considered is reduced to 14 speakers. In this dataset, for 
a given reference speaker, a subject speaker can be chosen in ways (with the 
repetitions allowed). This results in (\‘^) x = 196 different combinations of the 
subject and reference speakers. TableCurve2D curve-fitting package was used to fit 
simple models to each of 196 combinations. This was mainly carried out to search for 
the best simple fit relationships satisfying all the combinations. The best 20 simple 
models with least fitting errors were considered for each of 196 combinations. A 
subjective measure was developed to find the better models out of this whole lot (a 
total of 196x20 = 3920, including the repetition of models across the combinations). 

t (*) = binomial coefficient 
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Figure 3.1: Histogram of the speaker dependent parameter, a in model based nor- 
malization. 

Figure shows the histogram of a.Tzs, (Eq. (3.1)), plotted for all the speakers in (a) 
PnB database and (b) HiL database. The histograms clearly show 3 clusters depict- 
ing the gender separability. Boy and Girl speakers of HiL database were considered 
jointly as child speakers. 


The following procedure was developed to find the best models that satisfy all the 
combinations. 

1. Rank the model-list obtained for each combination (The model-list will be 
arranged in the descending order of accuracy of fit or ascending order of fitting 
error). 

2. Define N = [Ni N 2 ••• Nl]- % {j^^ element of N), represents the number of 
occurrences of model (in the list of 3920 models). L represents the number 
of distinct models. 
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3. Define a measure, 


_ Rjj (100 — Atj) 

Nj ^ 100 

2=1 


i<i<L 


(3.9) 


Rij and k^j represent the rank and the accuracy (in %) of the model in 
combination. is the element of the vector Q where Q = [Qi Q 2 • • • Ql]- 

4. Calculate 

t = arg min Q (3.10) 

5. model is the model (out of L models) that best satisfies/fits all the 196 
combinations. 


It is evident from Eq. (3.9) that (100 — A^) is the percentage fitting error for 
model of combination. Hence, the product R^y (100 - A^) should be small for 
the models that fit the data better. Table 3.2 shows that Eq. ( 3 . 1 ) is the model 
that best fits the data for both PnB and HiL databases. This indeed answers our 
second question. But one should be aware of the catch here. The higher-order 
models fit the data better than the lower-order models. Since, we were interested in 
one-parameter models, we studied the validity of one-parameter models. The multi- 
parameter models in Table 3.2 were reduced to one-parameter models by fixing 
the parameters with less variance. Only the 10 best models in Table 3.2 were 
considered while reducing the multi-parameter models to one-parameter models. 
Table 3.3 shows one-parameter models for PnB and HiL databases along with the 
fitting errors calculated using Eq. (3.11). Suppose f 7 ^ = h {a.ns,Fs) is the model, 
the curve-fitting error was estimated as 

M 

2=1 

where M is the number of speakers in the database. Table 3.3 substantiates the 
validity of Eq. (3.1). 
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Rank 

Model Equations 

PnB 

HiL 

1 

y = alog (l -f £)' 

y = alog (l -h f)' 

2 

y = a + hx^ 

y = a + 

3 

y = ax*’ 

y = ax^ 

4 

y = a -f he~^H 

y°-5 = a -1- ba:°'^ 

5 

yO-5 = a -1- 

log(y) = a -f b log(x) 

6 

y = a + brr 

y = a -f bar 

7 

log(y) = a4-blog(a;) 

y = a + ba;'^ 

8 

y~^ = a-fba:“^ 

y~^ = a -4- bx“^ 

9 

y^ = a -f- 

y^ = a -1- bx^ 

10 

11 

y = ^+iolu 

y°-5 = a + ba:°-® log(a:) 

y°‘® = a 4- bx°‘® log(x) 

12 

y^ = a + hx^ log( 2 :) 

y = a 4* bx log(x) 

13 

y — di + hx log(a:) 

y2 = a 4- bx^ log(x) 

14 

log(y) = a-hb(log(a;))^ 

log(y) = a-f b(log(x))^ 

15 

y~^ — 3i + ba:“^ log(a:) 

y~^ = a 4- bx“^ log(x) 


Tabic 3.2: Best simple curvefits for vowel data. 

The equations with smaller rank are the models that best fit the data. It is interesting 
to note the similarity in the models for PnB and HiL databases. Eq. (3.1) is the 
model that fits the data best for both PnB and HiL databases. This answers the 
validity in choosing Eq. (3.1). 

3.4 Comparison of Model Based Frequency Warp- 
ing Function and Mel- Warp Function 

As mentioned earlier, one of the motivating factors in choosing the model in Eq. (3.1) 
is the functional similarity of Eq. (3.6) with Eq. (3.7). We made a comprehensive 
study to verify the existence of mel-like warping function obtained only from speech 
data. Since the model in Eq. (3.1) is non-linear, the error performance surface may 
have many minimas. Let us define T = (a 7 ^,b,c). The estimation of parameters 
defined by T requires the initial estimate of the parameters along with the search 
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Rank 

PnB 

HiL 

Model 

e 

Model 

e 

1 


1.98E4 

y = alog(l + 5^)-« 

1.40E4 

2 

y = a + 2.64 :c°-93 

3.68E4 

y = a+1.92a;0-9® 

2.28E4 

3 

y = ax°'®® 

1.98E4 

y = arc®-®® 

1.40E4 

4 

y = a + e'"" 

9.92E5 

y = a + e~^ 

9.86E5 

5 

y°'® = a + 

2.40E4 

yO-5 = a + 

1.55E4 

6 

y = a4- 1.04a; 

4.02E4 

y = a + 1.02a; 

2.38E4 

7 

log(j/) = a + 0.98 log(a;) 

1.98E4 

log(y) = a + 0.981og(a;) 

1.40E4 

8 

y~^ = ba;~^ 

2.02E4 

11 

L 

1.43E4 

9 

y^ = a + l.lla;^ 

9.18E4 

y^ = a + 1.07a;^ 

6.31E4 

10 

rt# — . « ( 8.76x 
y ^ a-T 

3.52E4 

^ ^ ^ loR(a) 

2.17E4 


Table 3.3: Best one parameter models for vowel data. 

Here, the rank does not signify the quality of fit. The one- parameter version of 
Eq. (3.1) is the best model along with few other models that fits the data with least er- 
ror for both PnB and HiL databases. This answers the validity in choosing Eq. (3.1). 


range. The values of b and c in Table 3.1 were obtained by choosing the initial 
estimates to be 1 and range of search to be 5ft, i.e. (- 00 , 00 ). The solution for T 
obtained with these initial estimates may not be the global minima. It is important 
to note that b is the parameter that actually determines the shape of the warping 
function, c is just a scaling factor that is hardly of any importance. We developed 
the following procedure to find the initial estimate of b (over the region of interest) 
that gives the minimum error according to Eq. (3.11). We mainly carried out this 
procedure to check whether an initial estimate of b around 500 to 1000, would 
provide the least residue than the other initial estimates of b. 

1. Define B = [bi ba • • • bi], where b^’ is the initial estimate of b and I is the 
number of different initial estimates. Choose the initial estimates of a and c 
to be 1. 

2. Fit the model defined in Eq. (3.1) for a given speaker with different initial 
estimates Tinitwi = (l>bj 7 1) and c e 5ft, 1 < i < I. 
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Figure 3.2; HivStogram of the initial estimates of b for model based normalization. 
Figure shows the histogram of the initial estimates of b (Eq. (3.1)) for (a) PnB 
database and (b) HiL database. In our experiments. B = [1 10 50 100 500 700 1000]. 
It is clear that hinuiai = 1 minimum e (Eq. (3.12)) for both PnB and HiL 
databases. 


3. Calculate the residue as 


£F.-a«.(n-^) 


(3.12) 


4. Determine the initial estimate of b that minimizes the residual defined in 
Eq. (3.12) as element of B, where 

1 , = are min e (3.13) 

i<i<i 


and e = [ei 62 * • • ©i). 


5. Repeat the steps 2 to 4 for all speakers in the database. 

Figure 3.2 shows the histograms of the initial estimates of b for PnB and HiL 
databases. It is dear from Figure 3.2 that hinuid = 1 maximum o num er 
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Figure 3.3: C()iui)arLson of warping function, W(/) derived from model based nor- 
malization with log-warp and mel-warp functions. 

Figure shows the warping functions derived using MEN method (Eq. (3.8)) for PnB 
and HiL databases, along with log-warp and mel-warp functions. Here, log-warp 
function rorre.'^ponds to the model defined in Eq. (3.14). 


of speakers for whom e (Eq. (3.12)) is minimum compared to other initial estimates 
of b. This analysis thus supports the warping functions defined in Eq. (3.8), which 
were derived with Tinitiai = (Ijljl)- Hence the warping functions derived from 
Eq. (3.1) are indeed reliable. Figure 3.3 shows W(/) for PnB and HiL databases 
along with log-warp and mel-warp functions. It shows that W(/) is very close to 
log-warp function than, to mel-warp function. Hence our motivation to verify the 
similarity between the speech-derived warping function and mel-warp function re- 
sulted in a more log-like function, which in turn refers to uniform scaling between 
the speakers. Now, here arises the contradiction. In Chapter 2, we showed that the 
warping function derived from the speech data is a compromise between the log-warp 
an<i mol-warp functions. But this analysis of speaker relationships has resulted in a 
log-like function. How does this difference come about? 
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3.5 Comparison of Model Based Frequency V^arp- 
ing Function and Log- Warp Function 

Inordei to convince ourselves with the result explained in the previous section, we 
experimented with a model different from Eq. (3.1), given by 


F-ji — ^-tisFs (3.14) 

which is unifoim scaling of the formants of two speakers, 'R, and 5, the scaling 
factor, being only speaker dependent and is independent of frequency. Applying 
logarithm to both sides of Eq. (3.14), we have 

log(^7^) = log(a7e.s) + log{Fs) (3.15) 

Ilenc.e the* w^arping function is given by 

W(/) = log(/) (3.16) 

which is log-w'arp function. 

A comparative study on the performance of various normalization proce- 
dures (discussed in Chapter 2) was carried out in a qualitative and quantitative 
sense to study the effectiveness of normalization. The results of these experiments, 
explained in Chapter 4 and Chapter 6 shows that the model in Eq. (3.14) does 
iK'tter normalization than simple linear scaling of Nordstrom & Lindblom, Fant’s 
non-uniform normalization and frequency dependent scaling methods. This shows 
that uniform scaling is better than other normalization methods. This is very inter- 
esting. It, is imj)ortant to note that the warping function derived from FDS method 
is actually a discrete one. The discretization of the warping function has resulted 
in fine modelling of the non-linearity of the scaling factor in each frequency re- 
gion, thus deviating from uniform scaling. But, the warping function derived from 
Ecj. (3.1) is a continuous function of frequency, the parameters being obtained by 
fitting Eq. (3.1) to the formant data of subject and reference speakers. Hence, on 
an average, this produces a uniform scale-like relationship. The performance dif- 
ference (discussed in Chapter 4) between the linear scaling method of Nordstrom 
k Lindblom and uniform scaling obtained by MBN is quite interesting, the latter 
outperforming the former. The only difference between these two methods is the 
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parameters that are involved in the computation of scale factors. In Nordstrom & 
Lindblom method, only average in open vowels was considered in computing the 
scale factor, whereas in MBN method, all the formants, {Fi, F2, F3) were involved in 
the computation of ^tzs- On the whole, MBN provides a gross linear-scale relation 
among the speakers. 

3.6 Summary 

A model based vowel normalization procedure, motivated by the idea to study the 
relationship between the formant frequencies of speakers was presented. The warp- 
ing function derived from the assumed model was compared with log-warp and 
mel-warp functions. Model based normalization has resulted in a linear scale rela- 
tionship among the speakers, which is a gross approximation to the non-linearity 
present in their actual relationship. 



Chapter 4 


Comparison of Vowel 
Normalization Methods Using 
Separability Measures 


In the previous chapters, we discussed different methods of vowel normalization, 
eac'h of tliern aimed at reducing the speaker dependence. The warping functions were 
deriv(>d with an aim to apply these normalization methods to continuous spectral 
patterns. The criterion for the degree of success of these normalization procedures 
might he that they should maximally reduce the variance within each group of vowels 
wh(*n spoken by different speakers, while maintaining the separation between such 
groups. In this chapter, we compare these different vowel normalization methods by 
<iefining vnriotis measures, both in qualitative and quantitative sense. 


4.1 Residual Variance 

One of the measures [5] used by Pant to find the efficacy of the non-uniform nor- 
malization scheme is the percentage of variance remaining after non-uniform nor- 
malization when compared to the uniform normalization scheme of Nordstrom & 
Lindblom [1]. The variance in each of the three formants, Fi,F 2 ,P 3 after normal- 
ization is given by 

Vn = \kn, observed ~ ^n, predicted] > 71 = 1,2,3 (4-1) 

subject vmel 
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where observed is calculated using the actual value of the formant of each 
vowel of the subject and the reference speaker (i.e. average female), ^^predicted is 
the predicted value of scale factor for the formant of each vowel of the sub- 
ject using a given vowel normalization scheme. The percentage residual variance 
after non-uniform normalization compared to uniform normalization of Nordstrom 
k- Lindblom for the formant is defined as 

^n, uni form 

In our experiments with vowel normalization methods, we computed Rn for Pant’s 
normalization method, frequency dependent scaling method and model based nor- 
malization method. Table 4.1 shows the performance of different normalization 
schemes against the uniform normalization method. It is clear from Table 4.1 that 
the performance of FDS and MBN methods is comparable to Pant’s method even 
though they assume no a priori information about the vowel category and formant 
number unlike Pant’s method. Further, for HiL data, it can be seen that MBN 
outperforms Pant’s method especially for children. 


4.2 F-Ratio 

Since discriminability between vowel clusters is as important as reduction of variance 
within any given vowel cluster, a good measure for the usefulness of the normaliza- 
tion schemes would be F-ratio [10, 21]. In discriminant analysis, within-class and 
between-class scatter matrices are used to formulate criteria of class separability. In 
deriving F-ratio, one of the separability measures, let Mi and Cj denote the mean 
formant (Fi,p 2 iP 3 ) vector and its covariance matrix respectively, of the vowel 

I 

class. An equal probability of vowel classes is assumed. Let Mq = f where 

2 = 1 

I denotes the number of vowel classes being compared. Then, the within-class, S.^ 
and between-class, St scatter matrices, are computed by 

1 I 

s„ = j ^ C. (4.3) 

1=1 

1 ^ 

Sb = jYl 


(4.4) 
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Residual 
Variance (%) 

PnB 

HiL 

Ad. & Ch. 

Ad. 

Ch. 

Ad. & Ch. 

Ad. 

Ch. 


Ri 

90 

80 

108 

103 

75 

151 

Fant 

^2 

80 

78 

84 

89 

78 

97 


R3 

93 

92 

96 

78 

83 

74 


Ri 

88 

86 

91 

101 

84 

130 

FDS 

R2 

78 

82 

72 

81 

81 

81 


R3 

100 

97 

106 

82 

86 

79 


' Ri 

93 

96 

85 

80 

77 

84 

MBN-1 

R2 

72 

79 

62 

79 

74 

83 


R3 

84 

84 

84 

73 

78 

69 


Ri 

88 

90 

85 

87 

80 

98 

MBN-2 

R2 

79 

79 

65 

78 

76 

80 


R 3 

88 

88 

89 1 

76 

81 

71 


Table 4.1: Residual variance after normalization. 

Percentage variance remaining after different non-uniform normalization methods 
when compared to uniform normalization of Nordstrom & Lindblom, for the three 
formants. Here Ad. stands for adult speakers and Ch. stands for child speakers. 
Average female of the respective databases was considered as the reference speaker 
in computing residual variance. MBN— 1 refers to the model F-jz = (l + 
and MBN— 2 refers to the model Fn = ^tisFs defined in Chapter 3. 


where T represents matrix transposition. The separability criterion is then given by, 

J = trace{(Sb + S;,} (4.5) 

The vowel cluster discriminability in terms of F-ratio, J, for unnormalized (un- 
warped), uniform normalization. Pant’s non-uniform normalization, FDS and MBN 
methods are shown in Table 4.2. It is clear from Table 4.2 that, MBN does the best 
normalization followed by FDS method, both of them being context and formant 
independent unlike Pant’s method. In Eq. (4.5), as separability improves, J should 
approach the ideal value of 3. 
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F-Ratio (J) 

PnE 

HiL 

Ad. & Ch. 

Ad. 

Ch. 

Ad. & Ch. 

Ad. 

Ch. 

Un-normalized 

2.01 

2.21 

2.31 

2.13 

2.28 

2.31 

Nordstrom & Lindblom 

2.42 

2.45 

2.43 

2.47 

2.56 

2.37 

Fant’s non-uniform 

2.49 

2.52 

2.41 

2.52 

2.63 

2.40 

FDS 

2.47 

2.50 

2.47 

2.53 

2.61 

2.44 

MEN-1 

2.49 

2.51 

2.50 

2.56 

2.62 

2.46 

MEN-2 

2.49 

2.50 

2.50 

2.62 

2.50 

2.46 


Table 4.2: Vowel cluster discriminability in terms of F-Ratio. 

Performante of various vowel normalization schemes based on F-ratio measure, ap- 
plied on PnB and HiL databases. Ad., Ch., MEN— 1, MEN— 2 are the same as 
explained in Table 4.1. 


4.3 Scatter Plots 

The Fi - F 2 scatter plots for PnE and HiL databases are shown in Figure 4.1 and 
Figure 4.2. Larger the separation between the clusters and smaller the spread of 
the cluster, better is the normalization. Scatter plots in Figure 4.1 and Figure 4.2 
provide a visual measure showing that MEN and FDS methods provide better nor- 
malization than Fant and Nordstrom & Lindblom methods. 


4.4 Summary 

Different vowel normalization methods were compared both in both qualitative and 
quantitative sense. The efficacy of vowel normalization methods has to be judged 
from the size of the vowel clusters and the separation between the clusters after 
normalization. The quantitative measures like residual variance and F-ratio along 
with the qualitative measures like scatter plots were used to compare different nor- 
malization schemes. Model based normalization and frequency dependent scaling 
methods perform better than Fant and uniform scaling methods with respect to all 
the aforesaid measures. 
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Figure 4.1: Scatter plots of F 1 — F 2 for 10 vowels from Peterson k Barney database. 
Figure shows the scatter plots of Fi - F 2 for 10 Vowels from Peterson & Barney 
database with and without normalization. MBN—1, MBN—2 are the same as ex- 
plained in Table 4.1. As seen in the figure MBN—1 and MBN—2 followed by FDS 
method provides good separchility among vowel clusters. 
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Figure 4.2: Scatter plots of Fi — F 2 for 12 Vowels from Hillenbrand et cd. database. 
Figure shows the scatter plots of Fi — F 2 for 12 Vowels from Hillenbrand et al. 
database with and without normalization. MBN—1, MEN— 2 are the same as ex- 
plained in liable 4.1. seen in the figure MBN—1 and MBN—2 followed by FDS 
method provides good separcd>ility among vowel clusters. 







Chapter 5 

Estimation of Frequency- Warping 
Function Using Vowel Data 


The concept of differences in the vocal tract length in speakers leading to the ma- 
jor source of variability in speech is well established. This has lead to the “scale 
relationship” between the speakers. The motivation of speaker normalization has re- 
sulted in the application of scale invariant transforms (viz. Scale Transform [10, 22], 
Fourier-Mellin Transform [23]) in deriving the speech features. The basic idea is 
to warp a pair of mutually scaled spectra such that in the warped domain they are 
shifted versions of one another. By taking the magnitude of Fourier transform of 
these shifted versions, we get identical spectras in the warped domain. 


5.1 Scale Transform 


Briefly, the scale transform of a function, X(/) is given by. 


£’x(c) = I X(/) 

0 


g-jf27rcln/ 

~ 77 ~ 


df 


(5.1) 


and inversely. 


OO 

X(/) = I Px(c) 

— OO 


^‘27rcln/ 

~7r 


dc V / > 0 


(5.2) 


A basic property of the scale transform is that the magnitude of the scale transform 
of a function, X(/) and its normalized scaled version, y/aX{af), are equal (note that 
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0 < o; < 1 corresponds to dilation, while 1 < a < oo corresponds to compression), 
since 




OO 

/ _j27rcln/ 

VSX(o/) -—J- df 

0 


(5.3) 

(5.4) 


Eq. (5.4) shows that the scale transform of ■y/aX(Q;/) is same as that of X(/) except 
for a linear phase, which disappears by taking magnitude on both sides of Eq. (5.4), 
i.e., 

|D^(c)| = |Dx(c)| (5.5) 

Thus, considering two speakers who are scaled versions of one another, a being 
their characteristic scale factor, Eq. (5.5) shows that in the scale transform domain, 
both the speakers look alike, as the speaker dependent term that appears in phase 
is nullified by taking magnitude. Thus, there is no need to explicitly calculate the 
speaker specific scaling constant. The scale transform may also be calculated as the 
fourier transform of the function X(e^)e 2 i.e., 

OO 

Dxic) = J X(e0e2 df (5.6) 

— OO 

It is to be noted that as a result of log- warping, i.e., forming X(e-^), the speaker 
specific scale constant, a, is purely a function of translation parameter in the log- 
warped domain. 


5.2 Frequency Warping Function 

Consider two speakers A and B related by 

S^(/) = Ss(W) (5.7) 


where S(.) denotes the spectral envelopes and is the scale factor of the subject 
speaker B, with respect to the reference speaker, A. This would be the case if 
uniform scaling was true. Consider the warping function f = e^ which is applied to 
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all speakers. This is log-warping of the frequency axis. In the log-warped domain, 
the scaling factor appears as a translation factor. 

S,{v) = S^(/ = e") = S 5 (a^e") 

= = Sb{v + Ina^) (5.8) 

Note the use of lower-case subscripts that denote the spectras or functions in the 
warped domain or u-domain. Thus, in the log-warped domain, the warped spectras 
are shifted versions of one another. The magnitude of their Fourier transform leads 
to scale invariance. 

, |F(S^(t;))| = |F(Se(^ + lna^))| 

where F(.) represents the fourier transform operator. Here exponential sampling 
denotes linear scaling of the frequency axis, which is realized as equal sampling in 
log-domain. Figure 2.1 shows that the scale factor, cxab is indeed formant dependent 
and context dependent, resulting in non-uniform scaling. In such a case, the relation 
between spectral envelopes of two speakers can be modelled as 

SAif) = Ss(a^(/)/) (5.9) 

where ocabU) is a frequency-dependent, non-uniform scaling factor. Analogous to 
uniform scaling, our goal is to find a transformation, / = z{v) such that 

Sa('t^) = SaU = z(t^)) 

= %{<^ABU)f) = Sb(t' + Ub) (5-10) 

where <;ab is dependent only on the speakers A and B, and is independent of fre- 
quency. Now, our aim is to find the function z(.), which warps the spectras, thus 
making them shifted versions in the warped domain. Since finding the exact form 
of z(.) is difficult, we discretized the computation of the warping function [7, 24]. 


5.3 Numerical Computation of the Warping Func- 
tion 

5.3.1 Discrete-Implementation of Warping Function 

We now obtain a relationship between / and u at a discrete set of frequencies. Let us 
divide the frequency axis into N logarithmically equi-spaced regions. In each region, 
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let us assume that the spectral envelopes of any two speakers are scaled versions of 
one another. So, for frequency region, / e [Lj,Ui), we have 

S^(/) = SB{a%f) , Li < / < Ui (5.11) 

where is the scale factor for frequency region, Lj and Uj are the lower and 
upper frequency boundaries of region respectively and 1 < i < N. Define 

~ (5-12) 


which assumes that the frequency dependency is present in the parameter /? and 
oi_ab is only speaker dependent (independent of frequency). We need to compute 
Sb(u = log(/)) for V € [log(L,),Iog(Ui)). Let us discretize the computation of Sb(t') 
at Mj equally spaced intervals in the region log(Lj) to log(Uj). Let 


Au, = 


log(Ui) - log(Li) 


(5.13) 


Then the uniformly spaced samples in the frequency region in u-domain are 
Sb im^AVi + log(L,)) for mj = 0, 1, • • • , (Mj - 1). Uniformly sampling Sa(u) at Au, 
spacing in the frequency region results in 


Sa {m^Avi + log(Li)) = Sb (m^Avi + log(Q!^g) + log(L, 
Eq. (5.14) can be rewritten as 


(5.14) 


S^{m^Av^ + log(Li)) = Sb ^ (mi + ) (5.15) 


It can be seen that the two functions differ by a translation factor in the 

frequency region. Since we define the warped envelopes to be translated versions 
of one another, over the entire range of interest, we require the following condition 
to be satisfied between any two frequency regions i and j. 

l3i\og(aAB) _ i^jlogiaAB) _ 1 ^R^ 

Avi ~ Avj AX '' 

where A is a new domain where the scaled spectras appear as shifted versions of one 
another. From Eq. (5.13), we have 


AujMj = AvjVij 


(5.17) 
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as log = log thus resulting in We can therefore choose Mj 

for different frequency regions i.e., the spacing of samples in u-domain, such that 
= PjlAj. The total number of samples, held constant, Mj’s are given by 

N 

Y^^i=^const (5.18) 

i=l 

»i = -r^ (5.19) 

With this choice of Mj’s, the non-uniformly spaced samples in u-domain are rep- 
resented as uniformly spaced samples in A-domain. Since the scale is arbitrary in 
A-domain, we can choose the spacing of samples and origin to some convenient val- 
ues. Eq. (5.19) shows that the calculation of Mj’s depend on /5i’s. So, we need to 
devise a procedure to compute /?i’s from the speech data, from which the warping 
function can be derived easily. This is the situation that was explained in Sec- 
tion 2.4.1. So, given the warping parameters (methods of computation of these 
parameters may be different), we can numerically compute the discrete warping 
function. Before examining the method of computation of warping parameters, let 
us consider the following situation. If there exists a simple linear scaling between 
two speakers, say A and B, then ajssU) ~ ^ab ~ scaling factor is only 

speaker dependent and is independent of frequency. So, J3{f) = 1 or in its discrete 
form, = 1, i = 1, 2, • • • , N. The warping function in such a case is given by 

W(/) = X = v = log(/) (5.20) 

Because of the non-linear scaling between the speakers, the scale factor will be both 
frequency dependent and speaker dependent. In such cases, P{f) models the non- 
linearity in the scale factor. In other words, the non-linearity in frequency region 
is modelled by /?i. Hence the discrete warping function is given by 

W.(/) = A=| = i^, = (5.21) 

5.3.2 Band Edge Problem 

The discussion in Section 5.3.1 shows that the sampling rate in u-domain changes 
abruptly at the band edges. Hence, though the sampling is uniform within a given 
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band, it is non-uniform over the whole t;-domain. The result is the loss of spectral 
samples at the band edges. To avoid this loss of spectral samples, we carried the 
following transition band analysis. Eq. (5.15) shows that the warped spectras in 
frequency region are shifted versions of one another, the shift being frequency de- 
pendent, as it is a function of /?j. It is the that determines the spacing of samples 
in u-domain, whose discontinuity at the band edge results in the loss of spectral 
samples. One way to avoid this problem is to make change gradually to ^p+i, 
p= 1,2, ,N — 1, over a region of K samples, i.e., we need 

Pp _ + A/? _ _ ^ = . . . = + KA/? _ /?p-n 22^ 

^X)p ^'Up -f- ^p,l -|- Sp^k ^^'Op -|- ^PjK 


where A/? is a factor which provides a transition in the values of /? across the 
adjacent regions, K is the number of frequency points over which /3p gradually changes 

Pp-\-l~Pp 


to /3p+i defined as K = p = 1,2,--- ,N- 1 and k = 1,2,--- ,K. {5p^k, 

A: = 1, 2, • • • , K} are the factors that are to be computed which provide the gradual 
change in the sampling intervals across two adjacent frequency regions, thus avoiding 
the loss of spectral samples at the band edges. Hence, given /?p, /3p+i and A/?, the 
factors {<5 'p,a:, k = 1,2, ••• ,K} can be determined from Eq. (5.22). We consider 
L points to the either side of the band edge over which pp changes to Pp+i, thus 
amounting to a total of K points, where L is given as 


, K is odd 


(5.23) 


2 — 1 , K is even 

From Eq. (5.16), it is clear that Avi oc Pi. Hence we have the following cases. 

1. Pp+i < Pp : AP < 0 and {Sp^k, k = ,K} forms a decreasing sequence. 

2. Pp+i - Pp-. A/? = 0 and {^p,fc = <5p,jk+i, fc = 1, 2, • • • , K} 

3. ppjfi > Pp : Ap > 0 and {6p^k, k = ,K} forms an increasing sequence. 


It is to note that the above analysis to override the band edge effects may not be the 
optimum method. In our case, Pp varies linearly within the transition band to Pp+i. 
Different variations can be tried out in the transition of pp within the transition 
band. Smaller the value of Ap, the transition from Pp to /3p+i will be smooth over 
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a large number of points. Once v-domain is discretized overriding the band edge 
effects, the warping function can be computed as explained in Section 5.3.1 using 
Eq. (5.21) except that the value of /3p in the transition region between and 
{p + 1)*'^ frequency regions should be taken as Pi + kAP, where A; = 1, 2, • • • K. 


5.3.3 Experimental Determination of Warping Parameters 

The warping parameters and Pi were computed experimentally from the vowel 
data of PnB and HiL databases. We had chosen N = 5, thus obtaining 5 loga- 
rithmically equi-spaced frequency regions. The reason for choosing N to be 5 will 
be explained later. Table 5.1 shows the frequency regions of interest for PnB and 
HiL databases. While estimating only two speakers were considered at a time, 
considering only, those pair of formants that lie within the same frequency region. 
For example, for each pair of speakers, A and B, we computed the ratio of formants 
in the frequency region as 


Ai,3,k) _ 

— 




A 


if £ [Li.Ui) 


(5.24) 


are the formants of the vowel of speakers A and B respectively 
and both of them lie in the same frequency region. We computed for all 

pairs of such formants that lie in the frequency region and obtained the average 
scaling factor, cz^g as the average of in z‘^ region for a given pair of speakers, 
A and B. The estimates of o;^ obtained were averaged over to find representing 
the frequency dependent scaling factors. pPs were estimated from the estimates of 
- as 

Pi = 


l0g(Q^*^) 

log(aW) 


for 1 < z < N - 1, 


('5.25'! 


for z = N. 


Since the higher formants are mostly affected by the length of the pharyngeal cavity, 
the uniform scaling holds and hence, we assumed Pn — l, thus making to be the 
ratio of formants in frequency region. 
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PnB 

HiL 

Band (Hz) 

A 


Band (Hz) 

P^ 


[190,356) 

2.13 

0.13 

[310,524) 

1.50 

0.03 

[356,667) 

1.22 

0.05 

[524,893) 

1.55 

0.03 

[667,1249) 

1.51 

0.05 

[893,1523) 

1.46 

0.03 

[1249,2339) 

1.27 

0.04 

[1523,2598) 

1.40 

0.02 

[2339,4381) 

1.00 

0.00 

[2598,4431) 

1.00 

0.00 


Table 5.1: Average estimates of /?i in 5 logarithmically equi-spaced frequency regions. 
denotes the standard deviation of pi for frequency band. Here, 1 < ^ < 5 


Database 

w(/) 

PnB 

HiL 

-1203.48 + 47.57 (log(/))^ 
-1323.47 + 49.70 (log(/))^ 


Table 5.2: Closed form equations for discrete warping function, Wj(/). 

W(/) is the closed form equation for the discrete warping function, ^i{f). The curve- 
fits were obtained by using TableCurve2D package 


5.4 Experiments and Results 

The experiments were carried out by overriding the band edge problems to obtain 
the discrete warping function. Table 5.1 shows the estimates of Pi along with their 
standard deviations for both PnB and HiL databases, obtained by averaging over 
all speakers. Figure 5.1 shows the discrete warping function, Wi(/) obtained with 
and without transition band analysis for PnB and HiL databases. Though, the 
warping functions look similar, practically, it is important to override the band edge 
problem. We chose |A^| = 0.0275 for PnB database and |A/?| = 0.010 for HiL 
database. Depending on the sign of {pp+i - pp) being +ve or -ve, i^p was chosen to 
be +ve or — ve for the band. The reason for choosing N to be 5 is to model the non- 
linearity in a better way. Smaller the value of N, more coarsely will be the modelling 
of the non-linearity. Larger values of N results in finer modelling of the non-linearity. 
But, large values of N results in less data available for the estimation of thus 
questioning its reliability. Hence, a trade-off was to be made between the finer 
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Figure 5.1: Discrete warping function, Wi(/) without transition band and with tran- 
sition band analysis. 

Figure shows the discrete warping function, Wj(/) determined with and without tran- 
sition band analysis for (a) PnB database and (b) HiL database. Though the curves 
look very similar, the frequencies at which f-domain is sampled are not exactly the 
same. 


modelling of the non-linearity and the reliability of estimates, resulting in choosing 
the value of N to be 5. Since the warping function obtained is discrete, we fitted 
simple curves to it using TableCurve2D to obtain W(/), which is more applicable for 
continuous spectral patterns. The equations of W(/) for PnB and HiL databases are 
shown in Table 5.2. Figure 5.2 shows the actual warping function and its closed form 
approximate, W(/) for PnB and HiL databases. It shows that the curvefits are reliable 
approximates to their respective originals. Figure 5.3 shows the plot of W(/) for PnB 
and HiL databases along with mel-warp, log-warp and Stevens & Volkman [18] data 
points. Mel-warp function defined in Eq. (2.17) is actually a curve-fit to Stevens 
k Volkman data points, which were the actual mel frequency data points obtained 
from psychoacoustic studies. The log-warp function refers to simple linear scaling 
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(a) 



Figure 5.2: Discrete warping function, Wj(/) and its closed form approximate, W(/) 
Figure shows the discrete warping function, Wi(/) and its closed form approximate, 
W(/) given in Table 5.2 for (a) PnB database and (b) HiL database. The curvefits 
to Vi{f) were the best fits obtained from TableCurve2D. 


from the point of view of speaker normalization. The mel scale was derived from 
psychoacoustic experiments that gave a perceptual measure of pitch. It is a hearing 
derived scale that relates perceived frequency, and the actual physical frequency. By 
contrast, the frequency warping function is a speech derived scale that maps physical 
frequency to an alternate domain, A, such that in the warped domain the speaker 
dependencies separate out as translation factors. Note the similarity between W(/) 
and the mel-warp function at frequencies greater than 3500Hz, and between W(/) and 
log-warp function at frequencies less than 500Hz. In between these frequencies, W(/) 
lies between mel-warp and log-warp functions, but closer to log-warp than to mel- 
warp function. This acts as a compromise between the simple linear scaling (based 
on speaker normalization) and the mel-scale (based on hearing experiments). This 
indeed is very interesting that draws some relation between the hearing mechanism 
and speech production. The degree of vowel normalization provided by W(/) derived 
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Figure 5.3: Comparison of warping function, W(/), log- warp, mel-warp functions and 
Stevens & Volkman’s actual mel data points. 

Figure shows the warping functions for PnB and HiL databases along with log-warp 
and mel-warp functions. Mel-warp is a function fitted to actual mel data points 
of Stevens & Volkman. It is interesting to note the similarity of these warping 
functions, though they are derived from entirely different studies. 


from both PnB and HiL databases, compared to log-warp and mel-warp functions 
is discussed in Section 6.3. 


5.5 Summary 

The basic theory of scale invariant transformation was presented. A method for 
incorporating non-linear scaling in such a paradigm was also discussed. The warp- 
ing function derived out of the study was compared with log-warp and mel-warp 
functions which was more log-like at frequencies less than 500Hz and more mel-like 
at frequencies greater than 3500Hz, and acting as a compromise in between these 
frequencies. 




Chapter 6 

Comparison of Vowel 
Normalization Methods in Vowel 
Classification Performance 


In Chapter 4, the efficiency of different vowel normalization methods was discussed 
in a more qualitative sense. The analytical measures defined in Chapter 4 give an 
idea of how well the normalization is done by various vowel normalization schemes. 
But our motivation to study and propose the vowel normalization methods was to 
make it applicable to speaker independent speech recognition. Hence, the best way 
to judge the efficiency of these normalization methods would be to implement them 
on a continuous-density HMM-based recognizer. All the normalization approaches 
attempt to normalize the feature vector of the speech signal, with the intention of 
reducing inter-speaker differences caused by vocal tract length variations. There are 
two broad approaches to feature based speaker normalization (1) The first approach 
is to directly estimate the “gross scale factor, a” either by maximum likelihood (ML) 
method [3, 4, 11, 16, 25] or by formant estimations (rhy=r^-’r r'o’ motivations) from 
the speech data [2, 26] and (2) The second type of systems use a suitable scale- 
invariant transformation [10], so that there is no need for explicit “o:” estimation. 
Since all the vowel normalization methods that we discussed were based on the 
formant data of the speakers, it is more logical to judge their normalization per- 
formance by applying them on a HMM-based vowel recognizer rather than on a 
continuous speech recognizer. In this thesis, we have implemented various vowel 
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normalization schemes on a HMM-based vowel recognizer with two variations, one 
being to estimate a explicitly in ML sense and the other being using a scale-invariant 
transformation. 


6.1 Hidden Markov Model Based Speech Recog- 
nizer 

Automatic speech recognizer is a system which allows the computer to recognize 
the spoken words of a person. Automatic Speech Recognition (ASR) problem is to 
find a sequence of words to a given set of acoustic features. It involves two stages 
(1) Feature extraction and (2) Pattern recognition. 

Feature -extraction stage is necessary to reduce the dimensionality of the 
problem and to get a parsimonious representation of the speech signal, where only 
phonetically relevant information is retained, eliminating unwanted distortions. In 
our experiments with normalization on the vowel recognizer, we followed two differ- 
ent methods in extracting the features depending on the normalization procedure 
used. The interested reader is referred to [8, 27] for the details about the feature 
extraction stage. In brief, standard Davis-Mermelstein [28] filterbank frontend was 
used to derive Mel Frequency Cepstral Coefficients (MFCC) for the normalization 
experiments which explicitly estimate the scale factor, a. The normalization exper- 
iments with scale-invariant transformation was carried out using the features com- 
puted from Weighted Overlapped Segment Averaging (WOSA) [29] analysis with 
non-uniform DFT. 

The pattern recognition problem for automatic speech recognition can be 
solved in any of the three paradigms (1) Vector quantization (2) Hidden Markov 
Modelling and (3) Artificial Neural Network. Speech recognition is associated with 
lot of uncertainties due to different variabilities (like speaker, channel noise, etc.) 
Stochastic modelling is a flexible method for accounting such variabilities. One of 
the major advantages of using HMMs in speech recognition problem is their ability 
to provide a uniform framework for stochastic representation of both acoustic and 
lexicon rules, along with other sources of knowledge. Eminent works can be referred 
to for more fundamental details on their usage in ASR [27, 30]. 
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6.2 Speaker Normalization on the Recognizer 

As it has been noted earlier that vocal tract size variations which result in scaling 
of the frequency spectrum of speech signals, account for a major portion of the 
inter-speaker variations. Hence, it is intuitive to normalize the frequency spectrum 
of each speaker, with proper estimation of the scaling factor. There exists a class 
of systems which differ in the manner in which the scale factor is estimated. On 
the other side, there exists a class of systems which use a suitable scale-invariant 
transformation thus avoiding the explicit scale factor estimation. 

6.2.1 Recognizers With Explicit Scale Factor Estimation 

The basic idea in this class of recognizers is to estimate the optimal scaling factor, 
a, for every speaker in the training set in maximum likelihood (ML) sense [3, 4], 
which is used to warp the utterance, thus building a normalized HMM. Similarly, 
during recognition, a. is estimated for every input speech utterance, which is then 
used to warp the speech. The decoding of the warped utterance is carried on the 
normalized HMM. Generally, the scale factor is computed for a speaker with respect 
to some reference speaker. But in this approach, the reference speaker notion is 
served by reference HMM. It is clear that the scaling factor estimation process 
requires a pre-existing HMM model. Hence, an iterative procedure is used to choose 
the best scaling factor for each speaker and then build a speaker-independent model 
using the warped training utterance, finally resulting in speaker-normalized model. 
The interested reader is referred to [3, 4] for full-fiedge details about the class of 
recognizers that estimate the scale factor in ML sense. 

6.2.2 Recognizers With Scale Invariance 

In this approach, a scale invariant transformation is applied on two scaled (linear/non- 
linear) spectras, thus transforming them to look similar. The main advantage in this 
approach is the lack of necessity to estimate the scale factor for every utterance, un- 
like the method explained in Section 6.2.1. The basic idea is to warp a pair of mutu- 
ally scaled spectra such that in the warped domain they appear as shifted versions 
of one another. The magnitude of the Fourier transform of these shifted functions 
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results in identical feature sets. This method allows the freedom to incorporate the 
non-linear scaling as a warping function. 

Earlier experiments [8] with non-linear scaling function (on a continuous 
digit recognizer) in this paradigm has shown inferior results compared to linear scal- 
ing incorporated in the paradigm explained in Section 6.2.1. Our basic hypothesis 
is that non-linear scaling should do better than linear scaling as there exists a non- 
linear relationship between the formant frequencies of speakers. Two obvious reasons 
can be thought of to explain the nature of the results. One reason may be due to the 
application of non-linear scaling function derived from vowel data on a continuous 
digit recognizer, where the consonants may not be normalized. The other reason is 
that the Fourier transform of the warped spectras are complex quantities. Though 
the magnitude of the Fourier transforms of two warped spectra removes not only 
the linear phase, ‘which is basically a speaker dependent term, but also the phase of 
the Fourier transform of the warped spectra. Thus the reference phase being lost, 
it is difficult to reconstruct the phase which plays its role in modelling the speech 
unit. Hence, the scale-invariant transformation approach cannot be applied as it is 
on the recognizer. 

One smart but laborious way is to estimate the shift factors in the warped 
domain in ML sense [31]. Since suitable frequency warping of the uniformly or 
non-uniformly scaled spectras generates shifted versions in warped domain, the shift 
factor can be explicitly estimated in ML sense, which finally boils down to the similar 
class of recognizers with explicit scale factor estimation. But computationally, this is 
more efficient than the other class as only the shifted versions of the base feature set 
has to be computed instead of computing the warped spectras for different warping 
factors. 


6.3 Experiments and Results 

This section presents an account of the experiments that were carried out to investi- 
gate the effectiveness of various speaker normalization procedures in the context of 
vowels. Speech recognition accuracy was used as a performance measure for speaker 
normalization. 
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A Tasks and Database 


The data for our vowel recognition experiments was collected from dialect region 
dr2 of TIMIT database, consisting of 71 male speakers and 31 female speakers. The 
size of the vocabulary was 12 vowels: /AA/, /AE/, /AH/, /AO/, /AW/, /EH/, 
/ER/, /EY/, IH/, /lY/, /UH/, /UW/. Inorder to avoid the dialect mismatch, we 
considered both the training and testing data from dialect dr2 itself. Training and 
testing sets were separated into male and female data inorder to obtain compact 
models for both male and female while training the HMM. Training set consisted of 
3981 utterances from 53 male speakers and 1770 utterances from 23 female speakers. 
Testing set was never exposed to models during any part of training. Testing dataset 
consisted of 1381 utterances from 18 males and 637 utterances from 8 females. These 


utterances were contributed by all the vowels. After decoding, the number of deletion 
errors (D), insertion errors (I) and substitution errors (S) were calculated. Percent 


accuracy, A defined as. 



X 100% 


(6.1) 


was used to evaluate the performance of various normalization procedures. 


B Vowel Recognizer 

The experiments were carried out on a HMM based vowel recognition system, using 
HTK [32]. We conducted our experiments on two different kinds of vowel recognizers 
(1) Frame-based vowel recognizer and (2) Utterance-based vowel recognizer. In the 
case of frame-based vowel recognizer, each vowel was modelled by single active state 
continuous density left-to-right HMM. The observation densities were mixtures of 
five multivariate Gaussian distributions with diagonal covariance matrices. The 
basic idea behind this recognizer was to decode each frame separately instead of 
decoding the whole utterance. Utterance-based vowel recognizer was developed by 
modelling each vowel by three active state continuous density left-to-right HMM, 
the observation densities being mixtures of 2 multivariate Gaussian distributions 
with diagonal covariance matrices. 

TIMIT data being recorded over microphone set, is sampled at 16kHz. 
Speech signals were sectioned with an overlapping window of 20ms frame size and 
with an overlap of 10ms. A first-order backward difference of of pre-emphasis with 
factor 0.97 was carried out followed by hamming windowing. 512 point EFT was 
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A 

M-M 

M-T 


F-T 

MFCC 

45.65 

37.08 

^32.71 

47.64 

Log-warp 

46.91 

35.85 

34.32 

44.96 

FDS-PnB 

46.91 

35.85 

34.32 

44.96 

FDS-HiL 

47.51 

34.50 

35.34 

48.54 

MBN-PnB 

1 47.17 

36.52 

34.00 

47.98 

MBN-HiL 

47.17 

36.52 

34.00 

47.98 

Mel-warp 

46.91 

35.85 

34.32 

44.96 

PWB-PnB 

47.51 

34.50 

35.34 

48.54 

PWB-HiL 

47.51 

34.50 

35.34 

48.54 


Table 6.1: Recognition performance of various vowel normalization methods on a 
frame-based vowel recognizer. 

Log-warp refers to Eq. (3.16). FDS-PnB and FDS-HiL are shown in Table 2.3, 
MBN-PnB and MBN-HiL in Eq. (3.8), Mel-warp in Eq. (2.17) and PWB-PnB, 
PWB-HiL in Table 5.2. M. denotes male gender and T denotes female gender. In 
the notation A — B, A denotes the gender of the training data set and B denotes the 
gender of the test data set. 

taken on each data frame for computing MFCC feature set, while using a 26 chan- 
nel mel-filterbank. In WOSA based methods, each data frame (without hamming 
windowing) was sectioned into hamming windowed subframes of 128 samples with 
an overlap of 90 samples. A smooth spectral estimate was obtained from 255 point 
autocorrelation function by computing 64-point DFT, the DFT being computed at 
frequencies determined by the warping function used. Thirteen dimensional feature 
vectors were used: normalized energy, c[l] — c[12] cepstra which were derived de- 
pending on the type of front-end signal processor used in implementing the warping 
function. 

Warping Function Implementation 

Consider the function v — z(/), where z(.) is the frequency warping function. Since 
z(.) warps the scaled spectras to appear as shifted version in u-domain, u-domain 
is a linear domain. Thus / = z“^(u) gives the discrete frequencies at which the 




spectras are to be sampled in /-domain. While computing WOSA based features, 
DFT is computed at required number of frequencies defined by / = z~^{v). An 
efficient implementation [33] of non-uniform DFT can be carried out to compute 
WOSA based features. 

Frame-based Vowel Recognizer 

In our experiments with frame-based vowel recognizer, we considered only the cen- 
tre frames of the vowel to model it. This was due to the reason that the vowel 
will be steady around the centre region rather than at starting and ending instants 
which are affected by articulations. In a given utterance of a vowel, the first and 
last frames were excluded and the remaining data was considered in developing the 
recognizer. 

Utterance-based Vowel Recognizer 

In our experiments with utterance-based vowel recognizer, the whole utterance of 
the vowel was considered to model the vowel by three active states, the observation 
density at each state being mixtures of two multivariate Gaussian densities with 
diagonal covariance matrices. 

C Vowel Recognition Performance 

Inorder to study the effect of warping functions in speaker normalization, we gener- 
ated gender dependent models by only using the train set data of male speakers or 
female speakers. The testing was carried out both with and without cross-genders. 
The experiments on frame-based yowel recognizer were conducted to study the base- 
line performance of the recognizer. Table 6.1 shows the baseline performances for 
different warping functions for frame-based vowel recognizer. The normalization 
experiments were not conducted for frame-based recognizer. Table 6.1 shows that 
log-warp function, MEN and FDS are consistently better than the other normaliza- 
tion procedures, which confirms our result explained in Chapter 4. The experiments 
on utterance-based vowel recognizer were conducted to examine the amount of nor- 
malization done by various normalization procedures. Table 6.2 shows the baseline 
recognition performance for different vowel normalization schemes for utterance- 
based vowel recognizer. Figure 6.1 depicts the percentage improvement in the 
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(A^, A 72 ) 

M-M 

M-T 

T-M 

T-T 

MFCC 

(60.83, 59.72) 

(44.54, 52.53) 

(41.76, 47.89) 

(59.71, 58.73) 

Log-warp 

(58.98, 56.84) 

(47.96, 53.83) 

(40.95, 51.15) 

(57.59, 60.36) 

FDS-PnE 

(58.98, 56.84) 

(47.96, 53.83) 

(40.95, 51.15) 

(57.59, 60.36) 

FDS-HiL 

(59.28, 56.91) 

(46.98, 50.08) 

(41.46, 49.59) 

(59.54, 59.71) 

MEN-PnE 

(58.68, 57.28) 

^ (47.47, 53.34) 

' (41.02, 51.37) 

(57.59, 58.73) 

MEN-HiL 

(58.68, 57.06) 

(47.31, 53.18) 

(40.87, 51.81) 

(56.93, 59.05) 

Mel-warp 

(58.61, 56.91) 

(46.82, 50.57) 

(39.54, 46.27) 

(59.87, 58.08) 

PWE-PnE 

(59.28, 56.91) 

(46.98, 50.08) 

(41.46, 49.59) 

(59.54, 59.71) 

PWE-HiL 

(59.28, 56.91) 

(46.98, 50.08) | 

(41.46, 49.59) 

(59.54, 59.71) | 


Table 6 . 2 : Recognition performance of various vowel normalization methods on an 
utterance-based vowel recognizer before and after normalization. 

Table shows various warping functions which are explained in Table 6 . 1 . The nota- 
tion {kb, An) shows that kb and kn are the recognition accuracies of baseline (without 
normalization) and with normalization respectively. 

recognition accuracy with normalization over the baseline for various vowel nor- 
malization schemes for cross-gender cases. It is in the case of cross-genders that the 
normalization performance can be judged more clearly. In similar-gender cases, the 
normalization was not much effective. It can be clearly seen from Figure 6.1 that 
MEN does the best normalization iox T — M case, followed by log-warp function 
and FDS. This again coincides with the result which we mentioned in Chapter 4 by 
using some analytical measures. For M. — T case, MFCC (which actually imple- 
ments linear scaling) does the best normalization, followed by MEN and log-warp 
function. 


6.4 Summary 

A brief overview of the HMM-based vowel recognizer was presented. The implemen- 
tation of two different vowel recognizers for vowel recognition and normalization task 
was discussed. The efficiency of vowel normalization methods was studied by apply- 
ing them to vowel recognition and normalization tasks. MEN, FDS and log-warp 




56 


n ■ 

MFCC 

0 

Log-warp 

X 

FDS-PnB 

+ 

FDS-HiL 

* 

MBN-PnB 

a 

MBN-HIL 

0 

Mel-warp 

V 

PWB-PnB 

A 

PWB-HiL 

— 

M-F 

• 

F-M 



gi ^ 1 . . ^ . 1 

123456789 


Figure 6.1: Percentage improvement in the recognition accuracy after normalization 
for various vowel normalization methods on an utterance-based vowel recognizer. 
Figure shows the improvement in the recognition accuracy after normalization with 
respect to the baseline performance for M. — T and T — M. cases. The percentage 
improvement for {kb, kn) is calculated as x 100, where kb and A„ are the same 
as explained in Table 6.2. 


function performed better than other normalization procedures both on frame-based 
and utterance-based vowel recognizer, thus confirming our previous result obtained 
through analytical measures. 




Chapter 7 
Conclusions 


In this thesis, we have studied the nature of relationships between formant frequen- 
cies of speakers of different age and gender using vowel formant data from Peterson 
& Barney and Hillenbrand et al. databases. Based on this study, a model based non- 
uniform vowel normalization method is proposed with an aim to achieve robustness 
to speaker variations in speaker independent speech recognition. We have also made 
a comprehensive study for frequency dependent scaling method and scale-invariant 
transformation method and have incorporated them into the state of art recognizers. 

Different measures, both objective (viz. residual variance, F-ratio) and 
subjective (viz. scatter plots) are used in studying the performance of vowel nor- 
malization procedures. The best normalization performance in terms of F-ratio and 
residual variance is obtained for model based normalization procedure. The pro- 
posed method gives substantial improvement over simple linear scaling in reducing 
the variance of vowel clusters. Scatter plots also show similar kind of performance 
for model based method when compared to other normalization procedures. 

The proposed model based normalization method was incorporated into a 
HMM-based vowel recognizer. From the recognition performance results, it can be 
inferred that the proposed method does the best normalization for cross-gender cases 
when compared to other normalization methods. 

The frequency-warping necessary to do non-uniform vowel normalization 
using the proposed model based method turns out to be similar to log- warp whereas 
for frequency dependent scaling and scale-invariant transformation methods, it turns 
out to be a compromise between log-warp and mel-warp, closer to log-warp at fre- 
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quencies less than 500Hz and closer to mel-warp at frequencies greater than 3500Hz. 
One reason for this kind of complementary behaviour of the warping function, 
though derived using varied methods but using the same vowel formant data, may 
be the gross approximation to linear scaling relationship between speakers when 
the entire frequency range is considered. Since the analysis in frequency dependent 
scaling and scale-invariant transformation methods is restricted to frequency bands, 
instead of the whole frequency range, the non-linearities in the speakers are well 
modelled than in the case of proposed model based method where the analysis is 
done over the entire frequency range. 

Future Work 

As it has been already noted, further studies need to be done to understand the 
contradicting behaviour of the warping functions derived from different vowel nor- 
malization methods. Detailed analysis can be done by studying the speaker relation- 
ships over different frequency bands. The warping functions for different methods, 
being derived from vowel formant data, were studied only on HMM-based vowel 
recognizers. It may be worth trying to implement these methods on a continuous 
speech recognizer, which gives an idea about the way non- vowels are effected by 
normalization. If the normalization performance on a speech recognizer turns out 
to be poor, there opens a wide area of research to normalize the non-vowel sounds. 
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